I I dry-read it the last time.. Next week. Yeah. Yeah. No. Uh mine's gonna be mostly using the off-line. But the actual stuff it's doing will be on-line. But it won't be very um processor intensive or memory intensive, I don't think. Don't think so. Yeah. Are we still gonna go for dumping it into a database? Are we still gonna dump it into a database? 'Cause if we are, I reckon we should all read our classes out of the database. It'll be so much easier. Well if we're gonna dump the part of it into a database anyway, we might as well dump all the fields we want into the database, calculate everything from there. Then we don't even have to worry that much about the underlying X_M_L_ representation. We can just query it. Well if we're gonna do that, we should try and store everything in in an X_M_L_ format as well. Yeah. Yeah. Well we don't even need to do that, 'cause if we got our information density calculated off-line, so all we do is treat the whole lot as one massive document. I mean they'll it's not gonna be so big that we can't load in a information density for every utterance. And we can just summarise based on that. I think you can do it on-line. I don't think there's really much point in doing like that when it's just gonna feed off in the end the information density measure basically. And that's all calculated off-line. So what you're really doing is sorting a list, is the p computationally hard part of it. Well like the ideas we're calculating are information density all off-line first for every utterance in the whole corpus, right? So what you do is you say if you're looking at a series of meetings, you just say well our whole document comprises of all these stuck together. And then all you have to do is sort them by j information density. Like maybe weighted with the search terms, and then extract them. I don't think it's too slow to do on-line, to be honest. Is that Yeah. Well, on the utterance level I was thinking. So the utterances with the highest like mean information density. Well the trouble with doing it on the word level is if you want the audio to synch up, you've got no way of getting in and extracting just that word. I mean it's impossible. For every single word? Oh, okay. Yeah. I don't think that will do it. We'll have to buffer it. Well the skimming's gonna use the importance. But like at first it's just gonna be I_D_F_. Well mostly skimming, yeah. Yeah. Well the nice thing about that is it will automatically be in sentences. Well more or less. So it will make more sense, and if you get just extract words. Yeah. I see it. But it'll need to be calculated at word level though because otherwise there won't be enough occurrences of the terms to make any meaningful sense. Yeah. Yeah, I reckon you can just mean it over the sentence. I think we should filter them. Maybe we should have like um a cut-off. So it a w word only gets a value if it's above a certain threshold. So anything that has less than say nought point five importance gets assigned to zero. Yeah, that's the other th Yeah. I think we'll have to buffer the audio. But I don't think it will be very hard. I think it would be like an hour or two's work. Like just build an another f wave file essentially. Yeah, I mean I bet there would be packages In memory, yeah. So just like unp there's bound to be like a media wave object or something like that. And just build one in memory. I don't know. I have no idea. But it must have like classes for dealing with files. And if it has classes for concatenating files, you can do it in memory. So Well what I think I might try and build is basically a class that you just feed it a linked list of um different wave-forms, and it will just string them all together with maybe, I don't know, tenth of a second silence in between each one or something like that. Normalise it, yeah. Oh yeah, yeah, we'll need that. We also really wanna be able to search by who's speaking as well. It doesn't matter, 'cause all the calculation's done off-line. That's easy. You just like create a new X_M_L_ document in memory. I don't think it's really that much of a problem because if it's too big, what we can do is just well all the off-line stuff doesn't really matter. And all we can do is just process a bit at a time. Like for summarisation, say we wanted a hundred utterances in the summary, just look at the meeting, take the top one hundred utterances in each other meeting. If it scores higher than the ones already in the summary so far, just replace them. And then you only have to process one meeting at a time. Okay, so maybe we should build a b store a mean measure for the segments and meetings as well? And speaker. Speaker and um topic segmenting we'll need as well. Yeah. Well yeah, and then it'll f preserve the order when it's displayed the Yeah. Yeah. Yeah, I think so. So we should basically make our own X_M_L_ document in memory that everyone's um module changes that, rather than the underlying data. And then have that X_M_L_ uh NITE X_M_L_ document tied to the interface. Well, you can make it in a file if you want. Mm-hmm. They are utterances, aren't they? The segments are utterances, aren't they? Yeah. Alright, okay. Well, that's easy. Well it's close enough, isn't it? It may not be exact every time, but it's a so sort of size we're looking for. Yeah, yeah. Yeah. But why don't we just write it as a new X_M_L_ file? Can NITE handle just loading arbitrary uh new like attributes and stuff? I mean, I would have thought they'd make it able to. Yeah. So why do we need to have two X_M_L_ trees in memory at once? The other thing is that would mean we'd be using their parser as well, which means we wouldn't have to parse anything, which be quite nice. 'Cause their parser is probably much faster than anything we've come up with anyway. Yeah, I mean we can process it in chunks if it gets too big basically. We can just process it all in chunks if it gets too big to load it into memory. I think we probably want to store Sorry. I think we probably want to store um a hierarchical information density as well. So like an informan mation density score for each meeting and each topic segment. 'Cause otherwise we'd be recalculating the same thing over and over and over again. Yeah. And that will obviously make it much easier to display. Well it may not for the whole meeting, but like Yeah, exactly. Yeah. Well, we can start off like that. Well I was gonna start off I've v got sort of half-way through implementing one that does just I_D_F_. And then just I can change that to work on whatever. Yeah. And it should be weighted by stuff like the hot spots and um the key-words in the search and stuff like that. Did he not say something about named entities? So I thought he said there wasn't very many. Yeah. Yeah. It's not T_F_I_D_F_, it's just inverse document frequency. 'Cause it's really easy to do basically. There's just like for a baseline really. Well, I'm half-way through. It's not working yet, but it will do. Um yeah. And then averaging it over the utterances. But it's not like um related to the corpus at all. It's just working on an arbitrary text file at the moment. No. It would be useful to know how everyone's gonna store their things though. Yeah. Yeah. Well I've got like a few hours free. Like after this. It's the most boring task. Yeah. Or at least um simple versions of them. So maybe we should try doing something really simple, like just displaying a whole meeting. And like just being able to scroll through it or something like that. Yeah. Are you free after this? How about Friday then. 'Cause I'm off all Friday. Uh Wednesday I've got a nine 'til twelve. Yeah, nothing in the afternoon. I've got nothing in the afternoon. So Okay. So you ha yeah. Where about, just in Appleton Tower? Uh I'll be in um the Appleton Tower anyway. Um well I'll be there from twelve. I've got some other stuff that needs done on Matlab, so if you're not there at twelve, I can just work on that. So Yeah. Why w Yeah. I'm just building a dictionary. Oh, mine's just gonna use the um hash map one in um Java. 'Cause I'm only gonna do it on small documents. It's just like bef until the information density is up and running. Just something to get give me something to work with. So it's only gonna use quite small documents, you see, to start with. Why does it need to be classified into like different segments? Can we just fill a second class with junk that we don't care about? Like, I don't know, copies of Shakespeare or something. 'Cause if what we're looking for is the um frequency statistics, I don't see how that would be changed by the classification. I the Well there maybe another tool available? Yeah. Um I can't remember who's got it. Might be WordNet. But one of these big corpuses has a list of stop words that you can download and they're just basically lists of really uninteresting boring words that we could filter out before we do that. It's like that's one the papers I read, that's um one things they did right at the beginning is they've got this big s stop-list and they just ignore all of those throughout the experiment. Yeah, I it would be useful for me as well. It uh I think that'd be useful for me as well. Yeah. Yeah. Well all you really wanna do is look into getting some sub-set of the ICSI corpus off the DICE machines. 'Cause I hate working on DICE. It's awful. Like so I can use my home machine. ha has a C_D_ burner though. has a C_D_ burner. Yeah. The right-hand corner, far right. Yeah. How big is it without um the WAV files and stuff? 'Cause I could just say at um going over S_C_P_ one night and just leave it going all night if I had to. It's yeah, I mean the wave data are obviously not gonna get off there completely. Really? Oh right? I'll see if I can S_C_P_ it, I suppose. I've got a Linux box and a Windows box. So Broad-band. Put it on to C_D_. I can if I get down I can put to C_D_. Yeah. I'm not sure if there's enough space. Is how much do we get? Really? Okay. Yeah, but I can do it from that session, can't I? You can compress it from a remote session and S_C_P_ it from the same session? Do you think? Yeah. Oh no no, I was thinking of SSHing just into some machine and then just SCPing it from there. Yeah. I mean it has to go through the gateway. But Can you not do that? Mm, I see. Yeah. So you could just But th first, uh how big are the chunks? How big are the chunks you're looking at? So quite small then. So you could just um you could use just the same thing we used to build the big dictionary. You just do that on-line 'cause that won't take long to build a little dictionary that big, will it. I mean just use the same tool that we use. Yeah. Yeah. It doesn't need ordered, no. Um well that's the t are you using T_F_I_D_F_ for the information density? Alright, okay. Like 'cause frequency would be useful, I think. But um depending on the context, the size, and what we consider a document in the sense of calculating T_F_I_D_F_ is gonna change. Which might need thinking about. I think it would be useful, yeah. Well you need the raw frequency as well. But um you also need how many times things occur within each document. And um what we consider a document's gonna depend on our context, I think. 'Cause if we're looking at the whole lot of meetings, we'll consider each meeting a document in sort of terms of this algorithm. And if we're viewing like say just a small topic segment you might look at even each utterance as a small document. Yeah, but the thing is um It's gonna need some th th thought of how we Actually maybe it doesn't actually matter. Maybe if you just do it once at the highest level, it it will be fine. But I was just thinking it might be difficult to calculate the T_F_I_D_F_ off-line for all the different levels we might want. 'Cause if we're gonna allow disjoint segments for example, then how are we gonna know what's gonna be in context at any given time? But I suppose if you just did it globally, treating a meeting as a document, it'd probably still be work out fine, because you'd only be comparing to ones within the context. Uh I don't know, I thought were you gonna use that in the end? The information density. Oh sorry, that's what I mean. Like um yeah, for each word or whatever, but across the whole lot is what I mean by highest level. Like across the whole corpus. Yeah, but you'd probably look at each meeting as a document. Mm possibly. Are they big enough to get anything meaningful out of? Well yeah, that is not it's not an issue. You just concatenate an X_M_L_ file together. but we still want to have like a notion of meetings for the user. Yeah, sure. Yeah, you just like whatever you want to look at, you just jam together into an X_M_L_ file and that's your meeting, even though bits of it may come from all over the place or whatever. I mean I don't see why that's really a big problem. So basically what you're saying is you can take an arbitrary amount of data and process it with the same algorithm. It doesn't matter conceptually what that data is. It could be a meeting. it could be two utterances. it could be a meeting plus half a meeting from somewhere else. I don't think it's very difficult though. I mean what you do is you just build an X_M_L_ file, and if you want it to get down to the utterances, you'd go to the leaves. And then if you wanted the next level up, you'd go to the parents of those and like just go from like the leaves inwards towards the branch to build up things like um you know, when you click on a segment, it's gonna have like words or whatever that are important. As long as like the algorithms are designed um with it in mind, I don't think it's a very big problem. Well like say you had um like say for a meeting, right, you've got like uh say a hierarchy that looks quite big, like this. And like the utterances come off of here maybe. Then when whatever your algorithm is doing, as long as when you're working with utterances, you go for all the leaves, like then if you need something next up, so like a topic segment, you'd go to here. But if you were looking at say this one, so only went like this. Right, so you it's same, you'd start with the leaves, and you go oh, I want a topic segment. So I go one layer up. See, and then if you're working with just a topic segment in there, it's the only thing you have to worry about. And like each time you want a higher level, you just need to go up the tree. And as long as your algorithm respects that, then we can just process any arbitrary X_M_L_ file with whatever hierarchical structure we want. A meeting, say, and that would be a topic segment. So I think as long as you build an algorithm that respects whatever structure's in the file, rather than imposing its own structure Well no, it doesn't have to be. But I mean it could be as many nodes as you want. Like this one could be deeper maybe, say. So then you'd start with all your utterances here, and when you go up to get topic segments, you go to here here here here here here here. That might be a bit confusing though 'cause you have things on different levels. Well Wednesday. Yeah. Yeah. So we'll see if we can get like a mini-browser just displays two things synched together of some kind. Yeah. Yeah. It'd be useful. I don't know who you see about that though. I d have no idea. I've probably got a reasonable amount because um everything on my DICE account can actually be deleted 'cause I store it all at home as well. Is that guaranteed to stay, the Maybe you should send a support form. Just say we want some web space. Listen to. Yeah. 'Cause that'd be really useful is if we had a big directory. Especially for transferring stuff. Having said that, are we allowed to take a copy of the ICSI corpus? Something we should probably ask before we do it.. Okay. Okay. No, me neither. Might be funny to see what is summarised the whole corpus as anyway. I think it'd be very useful. But We can just change the code. Is that it? That's quite good. Yeah. I could just use it with the frequency, I think, until the information density thing's finished. That would be really useful. If you're doing it in Java, could you um serialize the output as well as writing it to a file? If you're doing it in Java, could you serialize the um dictionary, yeah, as well as writing it to a file? It's really easy. I don't see why it'd be any more massive than the file. Yeah. It just saves you parsing the um file representation of it. And now 'cause I would be using it in Java anyway. So I'd just be building the data structure again. Yeah, but it seems like a bit silly to be parsing it over and over again kinda thing. I would've thought that um I think all the collections and things implement serializable already. I think they might do. Tonight I'll try and um I'll either work some more on uh the T_F_I_D_F_ summarizer or do the audio thing. Yeah. Do we have to demonstrate something next week? Yeah. Yeah, I know. I think it's 'cause we had to specify it ourselves that it's not as um like focus the specification of most um work we have to do. Yeah. Once we start doing it it will all become more or less obvious I think anyway.